Step 02 Data Wrangling¶

Introduction¶

In the previous step 01 Problem statement, we have defined the goal of this project and just have downloaded hte raw data.

In this step 02 Data Wrangling, we clean up the data as much as we can to make it ready for the next step EDA(Exploratory Data Analysis).

The major tasks will be to check the data type used, missing values, and transform the style as necessary.

Imports¶

import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from IPython.display import clear_output

import os
from library.sb_utils import save_file

Load data¶

dfm = pd.read_csv('../data/step1_dfm.csv')
dfr = pd.read_csv('../data/step1_dfr.csv',parse_dates = ['reviewTime'])

# convert dataframe values to string
dfm = dfm.astype(str)

# remove the html tags
# http://www.compjour.org/warmups/govt-text-releases/intro-to-bs4-lxml-parsing-wh-press-briefings/
def remove_html_tags(raw_html):
    from bs4 import BeautifulSoup
    cleantext = BeautifulSoup(raw_html, "lxml").text
    return cleantext

check duplicates¶

# Regarding meta data, check for duplicates
dfm.duplicated().sum(),dfr.duplicated().sum()

(21927, 243442)

check null¶

# Any null entries
dfm.isnull().values.any(),dfm.isnull().values.any()

(False, False)

There seems to be no missing data, but the current missing values are filled with empty, which saves time in the following process. So, we go ahead as they are now.

meta data¶

dfm.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 203766 entries, 0 to 203765
Data columns (total 19 columns):
 #   Column           Non-Null Count   Dtype 
---  ------           --------------   ----- 
 0   category         203766 non-null  object
 1   tech1            203766 non-null  object
 2   description      203766 non-null  object
 3   fit              203766 non-null  object
 4   title            203766 non-null  object
 5   also_buy         203766 non-null  object
 6   tech2            203766 non-null  object
 7   brand            203766 non-null  object
 8   feature          203766 non-null  object
 9   rank             203766 non-null  object
 10  also_view        203766 non-null  object
 11  main_cat         203766 non-null  object
 12  similar_item     203766 non-null  object
 13  date             203766 non-null  object
 14  price            203766 non-null  object
 15  asin             203766 non-null  object
 16  imageURL         203766 non-null  object
 17  imageURLHighRes  203766 non-null  object
 18  details          203766 non-null  object
dtypes: object(19)
memory usage: 29.5+ MB

category column¶

dfm.category.value_counts().to_frame()

It is found that the data in category has some levels although the first level, 'Movies & TV', will not be necessary

# check what is in 'Special Interests' category
dfm[dfm.category.str.contains('Special Interests')].category.unique()

array(["['Movies & TV', 'Genre for Featured Categories', 'Special Interests']",
       "['Movies & TV', 'Boxed Sets', 'Special Interests', 'Art & Artists']",
       "['Movies & TV', 'Independently Distributed', 'Special Interests']",
       "['Movies & TV', 'Boxed Sets', 'Special Interests']",
       "['Movies & TV', 'Boxed Sets', 'Special Interests', 'History']",
       "['Movies & TV', 'Boxed Sets', 'Special Interests', 'General']",
       "['Movies & TV', 'Boxed Sets', 'Special Interests', 'Nature & Wildlife']",
       "['Movies & TV', 'Studio Specials', 'Warner Home Video', 'Warner Video Bargains', 'Special Interests']",
       "['Movies & TV', 'Boxed Sets', 'Special Interests', 'Travel']",
       "['Movies & TV', 'Boxed Sets', 'Special Interests', 'Fitness']",
       "['Movies & TV', 'Boxed Sets', 'Special Interests', 'Health']",
       "['Movies & TV', 'Boxed Sets', 'Special Interests', 'Religion & Spirituality']",
       "['Movies & TV', 'Boxed Sets', 'Special Interests', 'Yoga']",
       "['Movies & TV', 'Boxed Sets', 'Special Interests', 'Dance']",
       "['Movies & TV', 'Boxed Sets', 'Special Interests', 'Cooking & Beverages']",
       "['Movies & TV', 'Genre for Featured Categories', 'Special Interests', 'Manufactured to the Highest Quality Available.', 'With True Enhanced Performance.', 'Latest Technical Development.']",
       '[\'Movies & TV\', \'Genre for Featured Categories\', \'Special Interests\', "In this video you will see demonstrations from a group of Hollywood\'s finest artists as they create a prosthetic makeup including false teeth and hair Also includes an interview and career profile of makeup legend Tom Burman.", \'While Supplies Last As always\']'],
      dtype=object)

I thought that the parents of 'Special Interests' is identical, but they are different and there is no consistency to avoid duplicated naming in the category structure.

# Convert the list-like string to list
from ast import literal_eval
dfm.category = dfm.category.apply(lambda x: literal_eval(x))

max_cat = dfm.category.apply(len).max()
max_cat

7

The maxium #level of categories is 7.

dfr.overall.value_counts(dropna=False)

5.0    5491586
4.0    1498047
3.0     735907
1.0     633153
2.0     406875
Name: overall, dtype: int64

dfr[dfr.overall=='']

# for i in range(max_cat):
#     dfm[str('cat'+str(i+1))] = dfm['category'].apply(lambda x: x[i] if i<len(x) else '')

# remove 'Movies & TV' as this meta and review dataframe is only ordinally from 'Movies & TV' category, and it is unnecessary
dfm['category'].apply(lambda x : x.remove("Movies & TV") if "Movies & TV" in x else x)

# print the unique values and its count
categories = dfm.category.value_counts().to_frame()

categories = categories.reset_index()

categories.columns = ['cat','num_cat']
categories.cat = categories.cat.apply(str)

# Number of movies per category
sns.catplot(data = categories.iloc[:10], x = 'num_cat', y = 'cat', kind = 'bar', orient= 'h', color="skyblue");

# stock = []

# for i in range(len(categories)):
#     stock.append(dfm.iloc[i].category[0])
# set(stock)

main_cat¶

dfm.main_cat.value_counts().to_frame()

dfm[dfm['main_cat']!='Movies & TV'].head()

They are unnecessary.

dfm = dfm[dfm['main_cat']=='Movies & TV']

description column¶

# Check the description
dfm.description.value_counts().to_frame().tail(2)

# check the data in the row of 66812.
dfm.iloc[66812].description

'[]'

Clean in the same way as above

# Fill empty categories with empty string
dfm['description'] = dfm['description'].apply(lambda x: '' if x == '[]' else x)

# Clean html letter code
dfm['description'] = dfm[['description']].applymap(remove_html_tags)

# confirm the result
dfm.iloc[66812].description

''

# check other rows
dfm[dfm.description.str.contains('///')].iloc[3].description

'[\'Journey in China  West Lake Hangzhou DVD Language: Chinese and English Subtitle: English-Chinese Bilingual Code Are: All Regions 16:9 NTSC ISBN: 9787884200153 Introduction: West Lake is located in the western area of Hangzhou City\\\'s historic center. There are dozens of lakes called West Lake worldwide, but "West Lake" usually refers to the Hangzhou West Lake. It is surrounded by mountains on three sides, with an area of around 6.5 square kilometers. The circumference is around 15 kilometers. The average depth of West Lake is 2.27 meters, and the capacity is about 14,290,000 cubic meters. The lake is divided by Gu Shan, Bai, Su and Yanggong Causeways into five areas. Ordered by their areas, they are Outer West Lake (), West Inner Lake (, or , or ), North Inner Lake ( or ), Little South Lake ( or ) and Yue Lake (). "Outer West Lake" is the largest. "Gu Shan" or Gu Hill is the largest natural island in the lake. Su & Bai Causeways run cross the lake. Three small man-made islands, "Xiao Ying Zhou" (), "Hu Xing Ting" (), "Ruan Gong Dun" (), lie in the center of Outer West Lake. Thus, the basic layout is "one hill, two causeways, three islands, and five lakes". /// This is a great product sourced from BIML - Bible In My Language, the leader in foreign language Bibles and outreach materials from Baltimore, Maryland in the USA. BIML stocks Bibles in more than 600 languages.\']'

We still have unnecessary characters such as the \\ or /// but now just leave them.

?? remove the \\ or ///

title column¶

dfm.title.value_counts().to_frame()

# Clean html letter code
dfm['title'] = dfm[['title']].applymap(remove_html_tags)

# remove any leading and trailing spaces
dfm['title'] = dfm[['title']].apply(lambda x: x.str.strip() if x.dtype == "object" else x)

# replace nan for ''
dfm['title'] = dfm[['title']].replace('nan', 'no-title')

dfm.title.value_counts().to_frame()

brand column¶

dfm.brand.value_counts().to_frame()

# Clean html letter code
dfm['brand'] = dfm[['brand']].applymap(remove_html_tags)

# remove any leading and trailing spaces
dfm['brand'] = dfm[['brand']].apply(lambda x: x.str.strip() if x.dtype == "object" else x)

# replace specific values for ''
dfm['brand'] = dfm[['brand']].replace(['nan','.','-','*','','None'], 'no-brand')

dfm.brand.value_counts().to_frame()

asin¶

print('The number of products: {}'.format(dfm.asin.nunique()))

The number of products: 181552

review data¶

dfr.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8765568 entries, 0 to 8765567
Data columns (total 4 columns):
 #   Column      Dtype         
---  ------      -----         
 0   overall     float64       
 1   reviewTime  datetime64[ns]
 2   reviewerID  object        
 3   asin        object        
dtypes: datetime64[ns](1), float64(1), object(2)
memory usage: 267.5+ MB

overall¶

dfr[['overall','reviewTime']].describe().T

reviewerID¶

print('The number of reviewer in review data: {}'.format(dfr.reviewerID.nunique()))

The number of reviewer in review data: 3826085

asis¶

print('The number of products in review data: {}'.format(dfr.asin.nunique()))

The number of products in review data: 182032

Delete unnecessary data¶

# Delete the rows that only exist in the either review or meta data
dfm = dfm.astype(str).drop_duplicates()
dfr = dfr.astype(str).drop_duplicates()

Save data ¶

This should be done aftre imputation of communityAverage as there are null values in the column.

# save the data to a new csv file
datapath = '../data'
save_file(dfm, 'step2_cleaned_dfm.csv', datapath)
save_file(dfr, 'step2_cleaned_dfr.csv', datapath)

A file already exists with this name.

Do you want to overwrite? (Y/N)y
Writing file.  "../data/step2_cleaned_dfm.csv"
A file already exists with this name.

Do you want to overwrite? (Y/N)y
Writing file.  "../data/step2_cleaned_dfr.csv"

Summary ¶

The key processes and findings from this notebook are as follows.

Rows deleted: The rows(asin in the meta data) have been deleted.
Columnns deleted: 0 column
meta data
- Category column: Some data which seem unnecessary such as 'Movies or TV' have been removed (possibly they could have been splited into separate columns of 'Movies or TV' and 'Featured genre') It was expected to visualise the structure of category usig node to clarify the structure, but it takes time and not the main purpose here, so category data was remained as it is.
- Description column: html script has been removed, but some symbols such as '\\' or '///' still remains. Those unnecessary symbols will be removed as necessary in the future steps.

	category
['Movies & TV', 'Movies']	30831
['Movies & TV', 'Genre for Featured Categories', 'Action & Adventure']	10761
['Movies & TV', 'Genre for Featured Categories', 'Drama']	9639
['Movies & TV', 'Genre for Featured Categories', 'Documentary']	8843
['Movies & TV', 'Genre for Featured Categories', 'Kids & Family']	8029
...	...
['Movies & TV', 'Charlie Rose Store', 'Entertainment', 'Television']	1
['Movies & TV', 'Independently Distributed', 'Documentary', 'Made in USA', 'Woven Badge', 'Ribbed back neck and cuffs', 'Printed front with plain back']	1
['Movies & TV', 'Studio Specials', 'Sony Pictures Home Entertainment']	1
['Movies & TV', 'African American Cinema', 'Breakthrough Cinema']	1
['Movies & TV', 'Fully Loaded DVDs', 'Five Star Collection']	1

	main_cat
Movies & TV	203464
Sports & Outdoors	118
Books	35
Amazon Home	33
<img src="https://images-na.ssl-images-amazon.com/images/G/01/digital/music/logos/amzn_music_logo_subnav._CB471835632_.png" class="nav-categ-image" alt="Digital Music"/>	31
Toys & Games	17
All Electronics	10
nan	10
Cell Phones & Accessories	9
Industrial & Scientific	8
Health & Personal Care	7
<img src="https://images-na.ssl-images-amazon.com/images/G/01/nav2/images/gui/amazon-fashion-store-new._CB520838675_.png" class="nav-categ-image" alt="AMAZON FASHION"/>	6
All Beauty	3
Arts, Crafts & Sewing	3
Office Products	2
Pet Supplies	2
Tools & Home Improvement	2
Musical Instruments	1
Home Audio & Theater	1
Camera & Photo	1
Grocery	1
Baby	1
Video Games	1

	title
nan	59
Treasure Island	15
Live	11
Mozart: Don Giovanni	10
WWE	10
...	...
Revolution (2012) ( (R)evolution ) [ Blu-Ray, Reg.A/B/C Import - Germany ]	1
Dido & Aeneas: A Choreographic Opera	1
All About John Deere For Kids VHS#4	1
Believe: The Eddie Izzard Story 2010	1
No Man's Land / Ronin / Harley Davidson and the Marlboro Man / Blown Away	1

	title
no-title	59
Treasure Island	15
Live	11
Mozart: Don Giovanni	10
WWE	10
...	...
The Goon Show Old Time Radio OTR Mp3 Collection on DVD - Offering 82 episodes with 38 hours of entertainment	1
Remembering September 11th	1
Boy of Two Worlds	1
Girl Who Leapt Through Space 2	1
No Man's Land / Ronin / Harley Davidson and the Marlboro Man / Blown Away	1

	brand
nan	66132
Various	3178
.	1041
\n	434
-	427
...	...
Paola Mendoza	1
Narration by Heather Ryan	1
Jzsef Madaras	1
Kato Buss	1
Grant Withers Frankie Darro	1

	category	tech1	description	fit	title	also_buy	tech2	brand	feature	rank	also_view	main_cat	similar_item	date	price	asin	imageURL	imageURLHighRes	details
793	[Christian Video, Biography]	nan	['When Superbook falls onto a computer keyboar...	nan	Moses & Plagues/Moses & Israelites (Su...	[]	nan	Tyndale	[]	5,869,236 in Books (	[]	Books	nan	nan	nan	084236837X	[]	[]	nan
794	[Christian Video, Bible]	nan	['Two inquisitive children stumble upon an anc...	nan	Nehemiah & the Wall / Job (Superbook Video...	[]	nan	Visit Amazon's Tyndale Page	[]	8,536,790 in Books (	['B00N37I5R2']	Books	nan	nan	nan	0842368167	[]	[]	nan
898	[Musicals & Performing Arts]	nan	["Learn how to play and practice the classic r...	nan	Basic Afro-Cuban Rhythms for Drum Set and Hand...	[]	nan	nan	[]	3,758,302 in Books (	[]	Books	nan	nan	$14.62	0876390300	[]	[]	nan
1274	[Movies]	nan	["The Family Video can be enjoyed as pure ente...	nan	Family	[]	nan	Memory Lane	[]	6,710,279 in Books (	[]	Books	nan	nan	$5.49	0976224976	[]	[]	nan
1498	[TV]	nan	["Beverly Hills 90210 - The Complete First Sea...	nan	Beverly Hills 90210 - The First Season [DVD]	[]	nan	nan	[]	2,550,757 in Beauty & Personal Care (	[]	All Beauty	nan	nan	nan	1415717214	[]	[]	nan

	description
['WITCHOUSE']	1
["Marco's latest release is his most detailed yet. The DVD includes almost one full hour of drum solos, performance with the Marco Minnemann big band and guitar hero Mike Kneally, several play-along tracks, instructional footage, comprehensive bonus feature"]	1

	brand
no-brand	68695
Various	3178
Learn more	358
Sinister Cinema	265
John Wayne	233
...	...
Narration by Heather Ryan	1
Jzsef Madaras	1
Kato Buss	1
Ty Burrell	1
Grant Withers Frankie Darro	1